Goto

Collaborating Authors

 new test


CLIFT: Analysing Natural Distribution Shift on Question Answering Models in Clinical Domain

arXiv.org Artificial Intelligence

This paper introduces a new testbed CLIFT (Clinical Shift) for the clinical domain Question-answering task. The testbed includes 7.5k high-quality question answering samples to provide a diverse and reliable benchmark. We performed a comprehensive experimental study and evaluated several QA deep-learning models under the proposed testbed. Despite impressive results on the original test set, the performance degrades when applied to new test sets, which shows the distribution shift. Our findings emphasize the need for and the potential for increasing the robustness of clinical domain models under distributional shifts. The testbed offers one way to track progress in that direction. It also highlights the necessity of adopting evaluation metrics that consider robustness to natural distribution shifts. We plan to expand the corpus by adding more samples and model results. The full paper and the updated benchmark are available at github.com/openlifescience-ai/clift


Use cases of Chi-squared test part1(Machine Learning)

#artificialintelligence

Abstract: Taking the goodness of fit test (Chi test) as an example, this paper attempts to calculate the Bayesian factor BF10 of n-fold Bernoulli test by the Excel software (using JASP software as the evidence). The results showed that in the range of 0.15โ€“0.55 Abstract: The sensitivity of gravitational wave searches is reduced by the presence of non-Gaussian noise in the detector data. These non-Gaussianities often match well with the template waveforms used in matched filter searches, and require signal-consistency tests to distinguish them from astrophysical signals. However, empirically tuning these tests for maximum efficacy is time consuming and limits the complexity of these tests. In this work we demonstrate a framework to use machine-learning techniques to automatically tune signal-consistency tests.


New Tests of Randomness and Independence for Sequences of Observations

#artificialintelligence

There is no statistical test that assesses whether a sequence of observations, time series, or residuals in a regression model, exhibits independence or not. Typically, what data scientists do is to look at auto-correlations and see whether they are close enough to zero. If the data follows a Gaussian distribution, then absence of auto-correlations implies independence. Here however, we are dealing with non-Gaussian observations. The setting is similar to testing whether a pseudo-random number generator is random enough, or whether the digits of a number such as ฯ€ behave in a way that looks random, even though the sequence of digits is deterministic.


A Fast and Effective Large-Scale Two-Sample Test Based on Kernels

arXiv.org Machine Learning

Kernel two-sample tests have been widely used and the development of efficient methods for high-dimensional large-scale data is gaining more and more attention as we are entering the big data era. However, existing methods, such as the maximum mean discrepancy (MMD) and recently proposed kernel-based tests for large-scale data, are computationally intensive to implement and/or ineffective for some common alternatives for high-dimensional data. In this paper, we propose a new test that exhibits high power for a wide range of alternatives. Moreover, the new test is more robust to high dimensions than existing methods and does not require optimization procedures for the choice of kernel bandwidth and other parameters by data splitting. Numerical studies show that the new approach performs well in both synthetic and real world data.


FDA authorizes new test to detect past Covid-19 infections

#artificialintelligence

The Food and Drug Administration on Friday issued an emergency authorization for a new test to detect Covid-19 infections -- one that stands apart from the hundreds already authorized. Unlike tests that detect bits of SARS-CoV-2 or antibodies to it, the new test, called T-Detect COVID, looks for signals of past infections in the body's adaptive immune system -- in particular, the T cells that help the body remember what its viral enemies look like. Developed by Seattle-based Adaptive Biotechnologies, it is the first test of its kind. Adaptive's approach involves mapping antigens to their matching receptors on the surface of T cells. They and other researchers had already shown that the cast of T cells floating around in an individual's blood reflects the diseases they've encountered, in many cases years later.


Validating Label Consistency in NER Data Annotation

arXiv.org Artificial Intelligence

Data annotation plays a crucial role in ensuring your named entity recognition (NER) projects are trained with the right information to learn from. Producing the most accurate labels is a challenge due to the complexity involved with annotation. Label inconsistency between multiple subsets of data annotation (e.g., training set and test set, or multiple training subsets) is an indicator of label mistakes. In this work, we present an empirical method to explore the relationship between label (in-)consistency and NER model performance. It can be used to validate the label consistency (or catches the inconsistency) in multiple sets of NER data annotation. In experiments, our method identified the label inconsistency of test data in SCIERC and CoNLL03 datasets (with 26.7% and 5.4% label mistakes). It validated the consistency in the corrected version of both datasets.


Google's TensorFlow gets a new test for training data leaks

#artificialintelligence

Google late last month debuted experimental tests for its TensorFlow Privacy library designed to reduce the degree to which machine learning models leak identifiable personal information in training data sets, such as for biometric facial recognition. The test module enables developers to "assess the privacy properties of their classification models," according to Google. The testing tool is known as a membership inference attack. Obvious applications for the technique include facial recognition and health care. This amounts to a second try for TensorFlow Privacy, which was introduced last year to address the "emerging topic" of privacy in machine learning, Google said.


Do ImageNet Classifiers Generalize to ImageNet?

arXiv.org Machine Learning

We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% - 15% on CIFAR-10 and 11% - 14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.


Do CIFAR-10 Classifiers Generalize to CIFAR-10?

arXiv.org Machine Learning

Machine learning is currently dominated by largely experimental work focused on improvements in a few key tasks. However, the impressive accuracy numbers of the best performing models are questionable because the same test sets have been used to select these models for multiple years now. To understand the danger of overfitting, we measure the accuracy of CIFAR-10 classifiers by creating a new test set of truly unseen images. Although we ensure that the new test set is as close to the original data distribution as possible, we find a large drop in accuracy (4% to 10%) for a broad range of deep learning models. Yet more recent models with higher original accuracy show a smaller drop and better overall performance, indicating that this drop is likely not due to overfitting based on adaptivity. Instead, we view our results as evidence that current accuracy numbers are brittle and susceptible to even minute natural variations in the data distribution.


AI assistants say dumb things, and we're about to find out why

#artificialintelligence

Siri and Alexa are clearly far from perfect, but there is hope that steady progress in machine learning will turn them into articulate helpers before long. A new test, however, may help show that a fundamentally different approach is required for AI systems to actually master language. Developed by researchers at the Allen Institute for AI (AI2), a nonprofit based in Seattle, the AI2 Reasoning Challenge (ARC) will pose elementary-school-level multiple-choice science questions. Each question will require some understanding of how the world works. The project is described in a related research paper (pdf).